Text Categorization and Classifications Based on Weka

AUTHORS

Sami Mohammed,Department of Computer Science, University of Victoria, 3800 Finnerty Road, Victoria, British Columbia V8W 3P6, Canada

ABSTRACT

As the volume of information available on the Internet and corporate increases, there is growing interest in developing tools to help people better find, filter, and manage these electronic resources. Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. This article is an attempt to touch base with this vital research field in Data Mining. Our interest in this article comes from my interest in doing research in NLP related to analyzing biomedical documents like Discharge summaries. The Data Collection of this project includes collecting several discharge summaries from the i2b2 as well as other text-based dataset documents that are notable for text categorization (like the 20-Newsgroups). The Data Preprocessing stage is a significant part of this project as it includes several NLP filters as well as converting the text format of the collected documents to ARFF as our investigation found that the Weka framework is the most suitable Data Mining framework for text analysis. The third stage involves experimenting with several machine learning algorithms for text categorization and classification. In this direction we found the SMO (Sequential Minimal Optimization) is the best classifiers that provide the highest accuracy even when the size of sample training size decreases or with the increase of the mix of different type of text documents included in the sample.

 

KEYWORDS

Text Categorization, Machine Learning, Weka API, Text Classifiers.

REFERENCES

[1] FABRIZIO SEBASTIANI, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1–47. http://courses.ischool.berkeley.edu/i256/f06/papers/sebastiani02.pdf(CrossRef)(Google Scholar)
[2] Salton, G. and McGill, M. Introduction to Modern Information Retrieval. McGraw Hill, 1983
[3] T. Joachims (1996). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96-118. Carnegie Mellon University. http://csmining.org/index.php/id-20-newsgroups.html(CrossRef)(Google Scholar)
[4] Uzuner Ö, Solti I, Cadag E. (2010). "Extracting Medication Information from Clinical Text". Journal of the American Medical Informatics Association. 2010;17:514-518 doi:10.1136/jamia.2010.003947.http://jamia.bmj.com/content/17/5/514.full.pdf(CrossRef)(Google Scholar)
[5] Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, Constantine D. Spyropoulos, An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, N.J. Belkin, P. Ingwersen and M.-K. Leong (Eds.), Athens, Greece, July 24-28, 2000, pages 160-167(CrossRef)(Google Scholar)
[6] Bureau of Medical Services, CASE MIX, MDS - RCA Training Manual, June 2004http://muskie.usm.maine.edu/mds/RCAManual.pdf
[7] Jin Huang, Jingjing Lu and Charles X. Ling, Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy, The Third IEEE International Conference on Data Mining, 2003, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.751&rep=rep1&type=pdf
[8] Edda Leopold and Jörg Kindermann, Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? Journal Machine Learning archive, Volume 46 Issue 1-3, 2002, Pages 423-444(CrossRef)(Google Scholar)
[9] Tina R. Patil and S. S. Sherekar, Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification, International Journal Of Computer Science And Applications Vol. 6, No.2, Apr 2013, http://www.researchpublications.org/IJCSA/NCAICN-13/189.pdf
[10] Cohen, W. W. 1995. Fast effective rule induction. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Tahoe, California. http://citeseer.ist.psu.edu/cohen95fast.html(CrossRef)(Google Scholar)
[11] Sun, Fan, Belatreche, Ammar, Coleman, SA, McGinnity, TM and Li, Yuhua (2012) Evaluation of LibSVM and Mutual Information Matching Classifiers for Multi-Domain Sentiment Analysis. In: The 23rd Irish Conference on Artificial Intelligence and Cognitive Science, Dublin City University. Logos Verlag, http://eprints.ulster.ac.uk/23455/(CrossRef)(Google Scholar)

CITATION

  • APA:
    Mohammed,S.(2017). Text Categorization and Classifications Based on Weka. International Journal of Advanced Research in Big Data Management System, 1(2), 7-22. http://dx.doi.org/10.21742/IJARBMS.2017.1.2.02
  • Harvard:
    Mohammed,S.(2017). "Text Categorization and Classifications Based on Weka". International Journal of Advanced Research in Big Data Management System, 1(2), pp.7-22. doi:http://dx.doi.org/10.21742/IJARBMS.2017.1.2.02
  • IEEE:
    [1]S.Mohammed, "Text Categorization and Classifications Based on Weka". International Journal of Advanced Research in Big Data Management System, vol.1, no.2, pp.7-22, Dec. 2017
  • MLA:
    Mohammed Sami. "Text Categorization and Classifications Based on Weka". International Journal of Advanced Research in Big Data Management System, vol.1, no.2, Dec. 2017, pp.7-22, doi:http://dx.doi.org/10.21742/IJARBMS.2017.1.2.02

ISSUE INFO

  • Volume 1, No. 2, 2017
  • ISSN(p):2208-1674
  • ISSN(o):2208-1682
  • Published:Dec. 2017

DOWNLOAD